R Markdown

This project used a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016.

The Olympic data is the result of an incredible amount of research by a group of Olympic history enthusiasts and self-proclaimed ‘statistorians’. Randi H Griffin scraped this data from the webpage and consolidated their work into a convenient format for data analysis.

Please check out their blog for more information: <www.sports-reference.com>

## Parsed with column specification:
## cols(
##   ID = col_double(),
##   Name = col_character(),
##   Sex = col_character(),
##   Age = col_double(),
##   Height = col_double(),
##   Weight = col_double(),
##   Team = col_character(),
##   NOC = col_character(),
##   Games = col_character(),
##   Year = col_double(),
##   Season = col_character(),
##   City = col_character(),
##   Sport = col_character(),
##   Event = col_character(),
##   Medal = col_character()
## )
## Parsed with column specification:
## cols(
##   NOC = col_character(),
##   region = col_character(),
##   notes = col_character()
## )

sport

number<-athlete %>% 
  select(Year,Name,Sex,Height,Weight,Season,Sport) %>% 
  filter(Season=="Summer") %>% 
  filter(Sex=="Male") %>%
  group_by(Sport) %>% 
  summarise(number=n()) 


sport<-athlete %>% 
  select(Year,Name,Sex,Height,Weight,Season,Sport) %>% 
  filter(Season=="Summer") %>% 
  filter(Sex=="Male") %>% 
  unique(by=c(Year,Name)) %>% 
  na.omit() %>% 
  group_by(Sport) %>% 
  summarise(mweight=mean(Weight),mheight=mean(Height)) %>% 
  left_join(number,by="Sport") 

sport$weight<-round(sport$mweight,2)
sport$height<-round(sport$mheight,2)

p <- plot_ly(sport, 
             x = ~mweight, 
             y = ~mheight, 
             color = ~Sport,
             colors='Paired',
             size = ~number,
             type = 'scatter',
             mode = 'markers', 
        marker = list(symbol = 'circle', sizemode = 'diameter',
                      line = list(width = 2, color = '#FFFFFF')),
        text = ~paste('Sport:', Sport, '<br> Mean Height:', 
                      round(mheight,2),'<br>Mean Weight:', round(mweight,2),
                      '<br> Number of Athletes:', number)) %>%
  layout(title = 'Mean Weight vs. Mean Height,1986-2016',
         xaxis = list(title = 'Mean Weight',
                      gridcolor = 'rgb(255, 255, 255)',
                      range = c(60,95),
                      zerolinewidth = 1,
                      ticklen = 5,
                      gridwidth = 2),
         yaxis = list(title = 'Mean Height',
                      gridcolor = 'rgb(255, 255, 255)',
                      range = c(160,200),
                      zerolinewidth = 1,
                      ticklen = 5,
                      gridwith = 2),
         paper_bgcolor = 'rgb(243, 243, 243)',
         plot_bgcolor = 'rgb(243, 243, 243)') %>% 
         layout(showlegend = FALSE)

country

Gcountry<-athlete %>% 
      select(Team,NOC,Year,Season,Sport,Games,Event,Medal) %>% 
      filter(Season=="Summer") %>% 
      filter(Medal=="Gold") %>% 
      unique(by=c("NOC","Year","Event")) %>% 
      group_by(Year,NOC) %>% 
      summarise(nmedal=n()) 
l <- list(color = toRGB("grey"), width = 0.5)  ### light grey boundaries 
    
    
g <- list(
      showframe = FALSE,
      showcoastlines = FALSE,
      projection = list(type = 'Mercator')
    )     ### specify map projection/options
 
   
p <- plot_geo(Gcountry) %>%
      add_trace(
        z = ~nmedal, 
        color = ~nmedal,
        colors = 'Blues',
        frame = ~Year,
        locations = ~NOC,
        marker = list(line = l)
      ) %>%
      colorbar(title = paste("Number of gold Medal")) %>%
      layout(title ='Evolution of the Olympics over time', 
             geo = g)
    
p

From the plot,we can see that at first there are only a few large countries participated in the Olympic Games,such as USA and AUS. As times goes by,more and more countries in Asia and Africa also participated in the Olympic Games.

Benford Analysis

height<-athlete %>% 
  select(Name,Sex,Height) %>%
  unique() %>% 
  na.omit()

fheight<-height %>% 
  filter(Sex=="Female")



bfd.gdp <- benford(fheight$Height,1)
plot(bfd.gdp)

From the first plot, we can see that the original data is in blue and the expected frequency according to Benford’s law is in red. We can notice that the data don’t follow Benford Law at all. And the reasons seems obvious- the vast majority of first digit for people’s heights is 1, and there is only a few people that are above 2 meters tall. So in no means can the height of athletes follow the Benford Distribution and Benford Law isn’t a prove that there is some obvious detected errors in data downloads,data generated by statistical procedures, and the inaccurate ordering of data.

Instead, I take a look at whether the data follows the normal distribution.

# for women athlete
p<-ggplot(height,aes(x=Height,color=Sex))+
  geom_density(size=1.2,adjust=1.5)+
  geom_histogram(aes(y=..density..,fill=Sex), alpha=0.1, 
                position="identity",bins=90)+
  theme_minimal()
ggplotly(p, width = 5)
fheight<-height %>% 
  filter(Sex=="Female")

qqnorm(fheight$Height);qqline(fheight$Height, col = 2)

From the qqplot we can see that in general, the heights of female athletes follow the normal distribution, despite the tail is a little bit heavier than normal distributin. It may be because athletes such as basketball players and baseball players who have a larger height accounts for a higher proportion than in the normal people group. Weightlifters and gymnasts can also explains this phenomenon in the same way.